46 research outputs found

    Evaluation Metrics for Unsupervised Learning Algorithms

    Full text link
    Determining the quality of the results obtained by clustering techniques is a key issue in unsupervised machine learning. Many authors have discussed the desirable features of good clustering algorithms. However, Jon Kleinberg established an impossibility theorem for clustering. As a consequence, a wealth of studies have proposed techniques to evaluate the quality of clustering results depending on the characteristics of the clustering problem and the algorithmic technique employed to cluster data.Comment: Technical Repor

    A Tool for Model-Based Language Specification

    Full text link
    Formal languages let us define the textual representation of data with precision. Formal grammars, typically in the form of BNF-like productions, describe the language syntax, which is then annotated for syntax-directed translation and completed with semantic actions. When, apart from the textual representation of data, an explicit representation of the corresponding data structure is required, the language designer has to devise the mapping between the suitable data model and its proper language specification, and then develop the conversion procedure from the parse tree to the data model instance. Unfortunately, whenever the format of the textual representation has to be modified, changes have to propagated throughout the entire language processor tool chain. These updates are time-consuming, tedious, and error-prone. Besides, in case different applications use the same language, several copies of the same language specification have to be maintained. In this paper, we introduce a model-based parser generator that decouples language specification from language processing, hence avoiding many of the problems caused by grammar-driven parsers and parser generators

    Scanning and Parsing Languages with Ambiguities and Constraints: The Lamb and Fence Algorithms

    Full text link
    Traditional language processing tools constrain language designers to specific kinds of grammars. In contrast, model-based language processing tools decouple language design from language processing. These tools allow the occurrence of lexical and syntactic ambiguities in language specifications and the declarative specification of constraints for resolving them. As a result, these techniques require scanners and parsers able to parse context-free grammars, handle ambiguities, and enforce constraints for disambiguation. In this paper, we present Lamb and Fence. Lamb is a scanning algorithm that supports ambiguous token definitions and the specification of custom pattern matchers and constraints. Fence is a chart parsing algorithm that supports ambiguous context-free grammars and the definition of constraints on associativity, composition, and precedence, as well as custom constraints. Lamb and Fence, in conjunction, enable the implementation of the ModelCC model-based language processing tool.Comment: arXiv admin note: text overlap with arXiv:1111.3970, arXiv:1110.147

    A DSL for Mapping Abstract Syntax Models to Concrete Syntax Models in ModelCC

    Full text link
    ModelCC is a model-based parser generator that decouples language design from language processing. ModelCC provides two different mechanisms to specify the mapping from an abstract syntax model to a concrete syntax model: metadata annotations defined on top of the abstract syntax model specification and a domain-specific language for defining ASM-CSM mappings. Using a domain-specific language to specify the mapping from abstract to concrete syntax models allows the definition of multiple concrete syntax models for the same abstract syntax model. In this paper, we describe the ModelCC domain-specific language for abstract syntax model to concrete syntax model mappings and we showcase its capabilities by providing a meta-definition of that domain-specific language.Comment: arXiv admin note: substantial text overlap with arXiv:1202.659

    The ModelCC Model-Based Parser Generator

    Full text link
    Formal languages let us define the textual representation of data with precision. Formal grammars, typically in the form of BNF-like productions, describe the language syntax, which is then annotated for syntax-directed translation and completed with semantic actions. When, apart from the textual representation of data, an explicit representation of the corresponding data structure is required, the language designer has to devise the mapping between the suitable data model and its proper language specification, and then develop the conversion procedure from the parse tree to the data model instance. Unfortunately, whenever the format of the textual representation has to be modified, changes have to propagated throughout the entire language processor tool chain. These updates are time-consuming, tedious, and error-prone. Besides, in case different applications use the same language, several copies of the same language specification have to be maintained. In this paper, we introduce ModelCC, a model-based parser generator that decouples language specification from language processing, hence avoiding many of the problems caused by grammar-driven parsers and parser generators. ModelCC incorporates reference resolution within the parsing process. Therefore, instead of returning mere abstract syntax trees, ModelCC is able to obtain abstract syntax graphs from input strings.Comment: arXiv admin note: substantial text overlap with arXiv:1111.3970, arXiv:1501.0203

    A Model-Driven Probabilistic Parser Generator

    Full text link
    Existing probabilistic scanners and parsers impose hard constraints on the way lexical and syntactic ambiguities can be resolved. Furthermore, traditional grammar-based parsing tools are limited in the mechanisms they allow for taking context into account. In this paper, we propose a model-driven tool that allows for statistical language models with arbitrary probability estimators. Our work on model-driven probabilistic parsing is built on top of ModelCC, a model-based parser generator, and enables the probabilistic interpretation and resolution of anaphoric, cataphoric, and recursive references in the disambiguation of abstract syntax graphs. In order to prove the expression power of ModelCC, we describe the design of a general-purpose natural language parser

    A Lexical Analysis Tool with Ambiguity Support

    Full text link
    Lexical ambiguities naturally arise in languages. We present Lamb, a lexical analyzer that produces a lexical analysis graph describing all the possible sequences of tokens that can be found within the input string. Parsers can process such lexical analysis graphs and discard any sequence of tokens that does not produce a valid syntactic sentence, therefore performing, together with Lamb, a context-sensitive lexical analysis in lexically-ambiguous language specifications

    Treating Insomnia, Amnesia, and Acalculia in Regular Expression Matching

    Full text link
    Regular expressions provide a flexible means for matching strings and they are often used in data-intensive applications. They are formally equivalent to either deterministic finite automata (DFAs) or nondeterministic finite automata (NFAs). Both DFAs and NFAs are affected by two problems known as amnesia and acalculia, and DFAs are also affected by a problem known as insomnia. Existing techniques require an automata conversion and compaction step that prevents the use of existing automaton databases and hinders the maintenance of the resulting compact automata. In this paper, we propose Parallel Finite State Machines (PFSMs), which are able to run any DFA- or NFA-like state machines without a previous conversion or compaction step. PFSMs report, online, all the matches found within an input string and they solve the three aforementioned problems. Parallel Finite State Machines require quadratic time and linear memory and they are distributable. Parallel Finite State Machines make very fast distributed regular expression matching in data-intensive applications feasible

    A Model-Driven Parser Generator, from Abstract Syntax Trees to Abstract Syntax Graphs

    Full text link
    Model-based parser generators decouple language specification from language processing. The model-driven approach avoids the limitations that conventional parser generators impose on the language designer. Conventional tools require the designed language grammar to conform to the specific kind of grammar supported by the particular parser generator (being LL and LR parser generators the most common). Model-driven parser generators, like ModelCC, do not require a grammar specification, since that grammar can be automatically derived from the language model and, if needed, adapted to conform to the requirements of the given kind of parser, all of this without interfering with the conceptual design of the language and its associated applications. Moreover, model-driven tools such as ModelCC are able to automatically resolve references between language elements, hence producing abstract syntax graphs instead of abstract syntax trees as the result of the parsing process. Such graphs are not confined to directed acyclic graphs and they can contain cycles, since ModelCC supports anaphoric, cataphoric, and recursive references

    An Automorphic Distance Metric and its Application to Node Embedding for Role Mining

    Full text link
    Role is a fundamental concept in the analysis of the behavior and function of interacting entities represented by network data. Role discovery is the task of uncovering hidden roles. Node roles are commonly defined in terms of equivalence classes, where two nodes have the same role if they fall within the same equivalence class. Automorphic equivalence, where two nodes are equivalent when they can swap their labels to form an isomorphic graph, captures this common notion of role. The binary concept of equivalence is too restrictive and nodes in real-world networks rarely belong to the same equivalence class. Instead, a relaxed definition in terms of similarity or distance is commonly used to compute the degree to which two nodes are equivalent. In this paper, we propose a novel distance metric called automorphic distance, which measures how far two nodes are of being automorphically equivalent. We also study its application to node embedding, showing how our metric can be used to generate vector representations of nodes preserving their roles for data visualization and machine learning. Our experiments confirm that the proposed metric outperforms the RoleSim automorphic equivalence-based metric in the generation of node embeddings for different networks
    corecore